library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(modelr)
library(hexbin)
options(na.action = na.warn)
data(diamonds)

Your client, the world’s only ethical diamond miner has asked you to build a model predicting diamond prices based on the four C’s: Carats (weight), Cut, Color, Clarity. You recognize that diamonds are a hoax, but a gig’s a gig so you have accepted the contract.

  1. Recall that the dataset diamonds is part of the tidyverse package and use ggplot to examine the relationships between price and cut, color, and clarity, respectively. Since those three C’s are categorical, geom_boxplot might be handy.
  1. Price versus cut
  2. Price versus clarity
  3. Price versus color
  4. Note any surprises or counterintuitive results.
#a
ggplot(diamonds, aes(cut, price)) + geom_boxplot()

#b
ggplot(diamonds, aes(clarity, price)) + geom_boxplot()

#c
ggplot(diamonds, aes(color, price)) + geom_boxplot()

#d
From the plot, we can see that generally lower quality diamonds often have higher prices which is very suprising.
  1. Carat is numerical, so use a different geom, geom_point to view Price versus Carat. There is clearly a relationship, but it is hard to see because there are so many observations in the data-set, over 50,000. We will use geom_hex to cram more observations into each plotted point.
  1. Do ‘dia <- ggplot(diamonds, aes(carat, price))’
  2. Now try ‘dia + geom_hex()’ and ‘dia + geom_hes(bins = 10)’. Note that the result is a heat map with a default number of bins.
  3. When we chose “bins = 10”, we got a lot more than 10 bins. What happened?
  4. Mess with the number of bins until you have a nice readable plot.
  5. Also, notice the syntax. We assigned a function to the variable dia and then used it as the function so we didn’t have to keep typing/copying it.
#a
dia <- ggplot(diamonds, aes(carat, price))

#b
dia + geom_hex()

dia + geom_hex(bins = 10)

#d
dia + geom_hex(bins = 20)

dia + geom_hex(bins = 30)

dia + geom_hex(bins = 40)

dia + geom_hex(bins = 50)

#c
The "bins" is the numeric vector giving number of bins in both vertical and horizontal directions. 
"bins= 10" gives 10 bins in both vertical and horizontal directions.
In addition, it also contains many outliers in the plot.
  1. It turns out all but 0.3% of the diamonds are below 2 carats, so filter out all the diamonds that are above that size and look at Price versus carat again. Call the new data-set dia_red (diamonds reduced) or something similar.
  1. We can see that the relationship is not linear. Hadley suggests that we look at a double log-transform of the data, so define log_pr and log_car to be of price and carat, respectively and add them to the data set. In practice, we might try several transforms, but data involving monetary amounts or populations often exhibit exponential characteristics, so a log transformation is reasonable. The command for obtaining of price is ‘log2(price)’. Note that the base of the logarithm is unimportant. Both e and 10 are common choices for the base.
  2. Plot the new variables and see if a linear model appears plausible. Don’t forget our new geom_hex friend if the plot looks too busy. If plausible, fit a linear model minimizing the usual norm on the residuals.
dia_red <- filter(diamonds, carat <= 2)
dia_red
## # A tibble: 52,051 x 10
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 52,041 more rows
#a
dia_2 <- diamonds %>% filter(carat <= 2) %>% mutate(log_pr = log2(price), log_car = log2(carat))
dia_2
## # A tibble: 52,051 x 12
##    carat       cut color clarity depth table price     x     y     z
##    <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
##  2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
##  3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
##  4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
##  5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good     I    VVS1  62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good     H     SI1  61.9    55   337  4.07  4.11  2.53
##  9  0.22      Fair     E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338  4.00  4.05  2.39
## # ... with 52,041 more rows, and 2 more variables: log_pr <dbl>,
## #   log_car <dbl>
#b
ggplot(dia_2, aes(log_car, log_pr)) + geom_hex(bins = 50)

lm_dia2 <- lm(log_pr~log_car, data = dia_2)
lm_dia2
## 
## Call:
## lm(formula = log_pr ~ log_car, data = dia_2)
## 
## Coefficients:
## (Intercept)      log_car  
##      12.209        1.697
  1. We could look at the line with our log-transformed scatterplot but it is more interesting to see the un-logged curve with the original data.
  2. Use modelr to create a one-dimensional ‘grid’ of 20 carat values spread evenly along the data-set values of carat in dia_red. In fact, the command is below, but say what each line is doing:
        grid <- dia_2 %>% 
        data_grid(carat = seq_range(carat, 20)) %>% 
            mutate(log_car = log2(carat)) %>% 
            add_predictions(lm_dia2, "log_pr") %>% 
        mutate(price = 2 ^ log_pr)
  1. Now repeat your plot of price versus carat from before Part (a) and use geom_line to to plot the curve. Here is the command, I apologize that it is called geom_line even though it is not a line. Explain to your partner what each line of the command is doing:
ggplot(dia_2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, colour = "red", size = 1)

  1. Since the residuals were minimized for the logged data, we should plot those residuals to see if a linear model was appropriate for the log-transformed data. Make a plot or two of the ‘log_resids’ and comment on it.
dia_2 <- dia_2 %>% 
  add_residuals(lm_dia2, "log_resid")

ggplot(dia_2, aes(log_car, log_resid)) + 
  geom_hex(bins = 50)

plot(lm_dia2,which=1:2)

  1. Fitting the residuals – We can think of the ‘log_resid’ as the patterns left in the data after the linear dependence on log_car is removed. While the residuals should appear random as a function of carat, there is no reason they should be independent of the other three variables.
  1. Plot the residuals against each of cut, clarity, and color and comment on the results. Are the associations as expected? Explain.
#a
ggplot(dia_2, aes(cut, log_resid)) + geom_boxplot()

ggplot(dia_2, aes(color, log_resid)) + geom_boxplot()

ggplot(dia_2, aes(clarity, log_resid)) + geom_boxplot()

The associations are as expected. As the quality of the diamond increases, it tends to become more relative to price. 
  1. Find the least squares fit of ‘log_pr’ against ‘log_car’ and the other three variables. (We are still using the reduced data set.) So this model has one response and 4 predictors.
  2. You can look at the summary of the model, but it is a mess. However, it is worth noting that we can explain 98.15% of the variation in price with these 4 variables. This is a significant achievement (pun intended).
#b
lm_dia2 <- lm(log_pr ~ log_car + color + cut + clarity, data = dia_2)

#c
summary(lm_dia2)
## 
## Call:
## lm(formula = log_pr ~ log_car + color + cut + clarity, data = dia_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10924 -0.12391 -0.00224  0.11703  2.74693 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  1.221e+01  1.765e-03 6922.181  < 2e-16 ***
## log_car      1.889e+00  1.170e-03 1614.839  < 2e-16 ***
## color.L     -6.353e-01  2.991e-03 -212.384  < 2e-16 ***
## color.Q     -1.362e-01  2.774e-03  -49.103  < 2e-16 ***
## color.C     -1.994e-02  2.586e-03   -7.712 1.26e-14 ***
## color^4      1.998e-02  2.364e-03    8.454  < 2e-16 ***
## color^5      6.179e-05  2.215e-03    0.028   0.9777    
## color^6      4.488e-03  1.993e-03    2.252   0.0243 *  
## cut.L        1.727e-01  3.459e-03   49.923  < 2e-16 ***
## cut.Q       -4.770e-02  3.040e-03  -15.692  < 2e-16 ***
## cut.C        1.571e-02  2.623e-03    5.990 2.11e-09 ***
## cut^4       -3.070e-03  2.092e-03   -1.467   0.1423    
## clarity.L    1.293e+00  5.290e-03  244.403  < 2e-16 ***
## clarity.Q   -3.148e-01  4.965e-03  -63.396  < 2e-16 ***
## clarity.C    1.618e-01  4.238e-03   38.168  < 2e-16 ***
## clarity^4   -7.572e-02  3.363e-03  -22.515  < 2e-16 ***
## clarity^5    2.873e-02  2.710e-03   10.600  < 2e-16 ***
## clarity^6    6.082e-04  2.336e-03    0.260   0.7946    
## clarity^7    4.791e-02  2.055e-03   23.314  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1905 on 52032 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9815 
## F-statistic: 1.535e+05 on 18 and 52032 DF,  p-value: < 2.2e-16
  1. Back to the full model with 4 predictors (log_car, cut, clarity, color), Make a plot of the residuals using ‘add_residuals’ and geom_hex. Things look ok, but there are some serious outliers.
dia_2 <- dia_2 %>% 
  add_residuals(lm_dia2, "log_resid2")

ggplot(dia_2, aes(log_car, log_resid2)) + 
  geom_hex(bins = 50)

a. Find the outliers that represent great buying or selling opportunities. (You don’t have to find the data points, just identify the right residuals.)

There are some diamonds with quite large residuals 
  1. If we bought the ‘underpriced’ diamonds for their indicated price and then sold them for a more average price (residual = 0), how much money would we expect to make.
dia_2 %>% 
filter(dia_2$log_resid2 < 0) %>%
add_predictions(lm_dia2) %>%
mutate(pred = (2 ^ pred)) %>%
select(price, pred) %>%
mutate(difference= pred - price) %>% 
summarise(sum(difference))
## # A tibble: 1 x 1
##   `sum(difference)`
##               <dbl>
## 1           7579811

Hopefully you can see why identifying outliers can be interesting for those motivated by money.

  1. The textbook gives a great set of commands to actually find the mispriced diamonds. If we can find something consistent about them, we have found a financial opportunity.
Opportunity <- dia_2 %>% 
            filter(abs(log_resid2) > 1) %>% 
            add_predictions(lm_dia2) %>% 
            mutate(pred = round(2 ^ pred)) %>% 
            select(price, pred, carat:table, x:z) %>% 
            arrange(price)

Explain each line of this model and try to find a reason for mispricing.

# log(price/predictPrice) >1; find diamonds with indicated price higher than average price
# get prediction
# select and then rank from lowest to highest

Exercises

  1. In the plot of log_pr vs log_car, there are some bright vertical strips. What is causing those?

    The bright vertical strips mean that there are some variations on `log_pr` although there seems to be a linear relationship between `log_pr` and `log_car`.
  2. If, what does that say about the relationship between price and carat? (Use any resource. If anyone is a chemistry major, they may know off-hand).

    They have a positive correlation. If carat is increased by 1%, it will cause an b1% increase in price.
  3. Does the final, full model do a good job of predicting diamond prices? Should your client trust it when buying diamonds or setting diamond prices? Give some justification for your answer.

summary(lm_dia2)
## 
## Call:
## lm(formula = log_pr ~ log_car + color + cut + clarity, data = dia_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.10924 -0.12391 -0.00224  0.11703  2.74693 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  1.221e+01  1.765e-03 6922.181  < 2e-16 ***
## log_car      1.889e+00  1.170e-03 1614.839  < 2e-16 ***
## color.L     -6.353e-01  2.991e-03 -212.384  < 2e-16 ***
## color.Q     -1.362e-01  2.774e-03  -49.103  < 2e-16 ***
## color.C     -1.994e-02  2.586e-03   -7.712 1.26e-14 ***
## color^4      1.998e-02  2.364e-03    8.454  < 2e-16 ***
## color^5      6.179e-05  2.215e-03    0.028   0.9777    
## color^6      4.488e-03  1.993e-03    2.252   0.0243 *  
## cut.L        1.727e-01  3.459e-03   49.923  < 2e-16 ***
## cut.Q       -4.770e-02  3.040e-03  -15.692  < 2e-16 ***
## cut.C        1.571e-02  2.623e-03    5.990 2.11e-09 ***
## cut^4       -3.070e-03  2.092e-03   -1.467   0.1423    
## clarity.L    1.293e+00  5.290e-03  244.403  < 2e-16 ***
## clarity.Q   -3.148e-01  4.965e-03  -63.396  < 2e-16 ***
## clarity.C    1.618e-01  4.238e-03   38.168  < 2e-16 ***
## clarity^4   -7.572e-02  3.363e-03  -22.515  < 2e-16 ***
## clarity^5    2.873e-02  2.710e-03   10.600  < 2e-16 ***
## clarity^6    6.082e-04  2.336e-03    0.260   0.7946    
## clarity^7    4.791e-02  2.055e-03   23.314  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1905 on 52032 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9815 
## F-statistic: 1.535e+05 on 18 and 52032 DF,  p-value: < 2.2e-16
plot(lm_dia2)

dia_2 %>% 
  add_predictions(lm_dia2) %>%
  add_residuals(lm_dia2) %>%
  summarise(sq_err = sqrt(mean(resid^2)),
            abs_err = mean(abs(resid)),
            p975_err = quantile(resid, 0.975),
            p025_err = quantile(resid, 0.025))
## # A tibble: 1 x 4
##      sq_err   abs_err  p975_err   p025_err
##       <dbl>     <dbl>     <dbl>      <dbl>
## 1 0.1904401 0.1480822 0.3855161 -0.3617156
From the plot, I think clients should trust it when buying diamonds because the full model is very good.